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Abstract 


The computational requirements for an adaptive solution of unsteady problems 
change as the simulation progresses. This causes workload imbalance among pro- 
cessors on a parallel machine which, in turn, requires significant data movement at 
runtime. We present a new dynamic load-balancing framework, called JOVE, that 
balances the workload across all processors with a global view. Whenever the 
computational mesh is adapted, JOVE is activated to eliminate the load imbalance. 
JOVE has been implemented on an IBM SP2 distributed-memory machine in MPI 
for portability. Experimental results for two model meshes demonstrate that mesh 
adaption with load balancing gives more than a sixfold improvement over one 
without load balancing. We also show that JOVE gives a 24-fold speedup on 64 
processors compared to sequential execution. 
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Abstract 

The computational requirements for an adaptive solution 
of unsteady problems change as the simulation progresses. 
This causes workload imbalance among processors on a 
parallel machine which, in turn , requires significant data 
movement at runtime . We present a new dynamic load- 
balancing framework, called JOVE , that balances the work- 
load across all processors with a global view . Whenever the 
computational mesh is adapted , JOVE is activated to elimi- 
nate the load imbalance . JOVE has been implemented on an 
IBM SP2 distributed-memory machine in MPI for portabil- 
ity . Experimental results for two model meshes demonstrate 
that mesh adaption with load balancing gives more than a 
sixfold improvement over one without load balancing. We 
also show that JOVE gives a 24-fold speedup on 64 proces- 
sors compared to sequential execution. 


L Introduction 

Unsteady flow computations in complex three- 
dimensional domains is a challenging task. It is particularly 
daunting when dynamic mesh adaption is used on unstruc- 
tured grids. The computational requirements for such prob- 
lems are extremely large both in terms of processing time 
and in-core memory, and can only be satisfied by large- 
scale machines [6, 8]. During a typical adaptive, unsteady 
computational fluid dynamics (CFD) calculation, the un- 
structured meshes are locally refined and/or coarsened to 
capture important flow features. As a result, the computa- 
tional intensity is not only time dependent, but also varies 
spatially over the problem domain. 

A parallel implementation of such computational meth- 
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ods on distributed-memory machines typically requires two 
steps [8, 12]. First, the computational mesh is partitioned 
into smaller submeshes. Second, the partitioned submeshes 
are assigned to processors based on a mapping strategy. 
While this static partitioning and mapping approach is ad- 
equate for CFD calculations that do not change in com- 
putational intensity over time, it is grossly inefficient for 
unsteady, adaptive calculations. This is because as the com- 
putational behavior changes, some processors may have a lot 
more work than others. The imbalance in the processor loads 
implies that the initial partitioning of the mesh is no longer 
acceptable. It is thus imperative that the amount of work as- 
signed to each processor be balanced at runtime to increase 
processor utilization and improve performance [4, 8J. 

Balancing the runtime computational load, however, is 
usually very difficult due to several reasons. These include a 
reliable measurement of the computational load, the amount 
of runtime data movement, and the minimization of inter- 
processor communication. Various methods on dynamic 
load balancing have been reported to date by numerous re- 
searchers; however, most of them lack a global view of 
loads across processors. A systematic way of measuring 
loads across all processors and then utilizing that informa- 
tion to balance the workload are needed for a method to be 
applicable to a variety of realistic applications. For example, 
the local detection and balancing of loads only among neigh- 
boring processors may be inadequate for large scientific ap- 
plications as it could leave some processors unbalanced. At 
the same time, the redistribution of processor loads must be 
efficient so as not to significantly delay the main application. 
If parallel CFD is to be successful on distributed-memory 
multiprocessors for practical problems, it is essential that 
a dynamic load balancing method be developed in a such 
way that all necessary modules can be combined together 
to collectively act as a coherent tool. Our purpose is to 
build such an environment for runtime load balancing with 
unstructured mesh adaption for unsteady CFD applications. 
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The dynamic load balancer, called JOVE, is intended to 
satisfy these requirements. It performs its task by compar- 
ing the computational gain for a balanced workload against 
the communication penalty arising from the data redistribu- 
tion. Each time the computational mesh is adapted, JOVE 
decides, based on the information collected from all proces- 
sors, whether repartitioning will be beneficial . If data move- 
ment is expensive compared to the computational gain, the 
mesh is not repartitioned and the CFD simulation continues 
without interruption. If, on the other hand, JOVE deter- 
mines that the cost of data movement is compensated by the 
improved load balance, the CFD application is interrupted 
to redistribute the data based on the new partitioning. The 
numerical simulation is then restarted. 

JOVE possesses three novel features. First, a dual graph 
representation of the computational mesh is used to keep 
the complexity and connectivity constant during the course 
of an adaptive computation. Second, a new inertial spectral 
mesh partitioning method [9] is introduced that performs 
both faster and better than Recursive Spectral Bisection [7]. 
Finally, accurate metrics for the computational gain and the 
communication cost are developed to measure and balance 
the processor loads between successive adapted grids. 

2. Background 

2.1. Unstructured tetrahedral mesh adaption 

CFD problems are usually represented as a grid of ver- 
tices and elements. Flowfield solutions are typically stored 
at the vertices while an element represents some compu- 
tation associated with it. During an adaptive calculation, 
the unstructured mesh is locally refined and/or coarsened to 
capture important flow features. The mesh adaption scheme 
used in this work is 3D_TAG [2] which has an edge-based 
data structure; that is, each tetrahedral element is defined 
by its six edges rather than by its four vertices. This edge 
data structure makes the mesh adaption procedure capable 
of performing anisotropic refinement and coarsening. 

At each mesh adaption step, tetrahedral elements are tar- 
geted for coarsening, refinement, or no change by comput- 
ing an error indicator for each edge. Edges whose error 
values exceed a user- specified upper threshold are targeted 
for bisection. Similarly, edges whose error values lie be- 
low another user-specified lower threshold are targeted for 
removal. Only three subdivision types are allowed for each 
tetrahedral element. The 1:8 isotropic subdivision is im- 
plemented by adding a new vertex at the mid-point of each 
of the six edges. The 1:4 and 1:2 subdivisions can result 
either because the edges of a parent tetrahedron are targeted 
anisotropically or because they are required to form a valid 
connectivity for the new mesh. When an edge is bisected, 


the solution vector is linearly interpolated at the mid-point 
from the two points that constitute the original edge. 

Mesh refinement is performed by first setting a bit flag to 
one for each edge that is targeted for subdivision in every 
element that shares it. The edge markings for each element 
are then combined to form a binary pattern. Elements whose 
patterns do not match the allowed types are continuously 
upgraded until none of the edges shows any further change. 
Each element is then independently subdivided based on its 
binary pattern. Special data structures are used in order to 
ensure that this process is computationally efficient. 

Mesh coarsening also uses the edge-marking patterns. 
If a child element has any edge marked for coarsening, 
this element and its siblings are removed and their parent 
element is reinstated. The parent edges and elements are 
retained at each refinement step so they do not have to be 
reconstructed. Reinstated parent elements have their edge- 
marking patterns adjusted to reflect that some edges have 
been coarsened. The mesh refinement procedure is then 
invoked to generate a valid mesh. 

22 . Dynamic load balancing 

A parallel implementation of CFD methods on multi- 
processors requires the computational mesh to be divided 
into smaller grids, each of which is then assigned to a pro- 
cessor. The degree of connectivity and the computational 
intensity of individual elements determine how they should 
be grouped to form the subgrids. This partitioning must 
be done in a way that approximately balances the computa- 
tional workload among processors. 

Figure 1 shows how mesh adaption adversely affects 
processor loads. To simplify the presentation, a small two- 
dimensional example is used. The mesh shown in Fig. 1(a) 
consists of 18 triangular elements. Assuming that four pro- 
cessors are used and that the computational intensity is uni- 
form for all elements, the mesh is initially divided into four 
subgrids by applying graph partitioning. Processors PO and 
PI are assigned five elements each, while processors P2 and 
P3 have four elements each. 

Changes in the computational mesh due to adaption 
makes parallel CFD difficult. As the numerical simula- 
tion progresses, some regions of the grid may contain more 
elements due to refinement while other regions may contain 
fewer due to coarsening. Figure 1(b) clearly indicates this 
after one refinement step. PO still has 5 elements; however, 
PI, P2, and P3 have 13, 12, and 6 elements, respectively. 
The mesh adaption will cause PI and P2 to perform more 
than twice the work of PO. Obviously, there is a severe load 
imbalance. If another adaption step is performed, the im- 
balance is likely to become even more critical, resulting in 
poor performance. In the extreme case, the use of a parallel 
machine would offer little advantage over sequential ones. 
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Figure 1. Initial and adapted meshes showing 
the need for dynamic load balancing. Also 
shown are the dual graph and the computa- 
tional weights on the adapted mesh. 


As the two snap shots shown in Fig. 1 suggest, it is ex- 
tremely important to dynamically repartition the new grid; 
however, it is not straightforward as there are many tech- 
nical issues involved. Repartitioning must be quick so that 
there is no significant delay in the CFD calculation. Post- 
partitioning steps must then be able to estimate the compu- 
tational gain and the communication cost to decide whether 
the new partitions are worth accepting. 

3. JOVE: The dynamic load balancing scheme 

3.1. Overview 

It has been shown that dynamic load balancing is ab- 
solutely necessary for unsteady adaptive CFD calculations. 
Figure 2 gives an overview of our approach to dynamic load 
balancing. The system consists of three modules: the load 
balancer JOVE, a CFD flow solver [1 , 12] and the 3D.TAG 
mesh adaptor [2]. Details of the CFD solver are beyond 
the scope of this paper, except to note that it generates error 
values for each edge that are then used by 3D-TAG to refine 
and/or coarsen the mesh. 



Figure 2. Dynamic load balancing framework. 


The first step of JOVE is Pre.eval (new) which de- 
termines if the new mesh warrants further action in terms of 
repartitioning and processor reassignment. The objective is 


to rapidly decide whether the mesh has changed significantly 
enough to consider repartitioning. If Pre.eval (new) 
recommends repartitioning, the Partition (new) step 
divides the new mesh into subgrids. A new inertial spectral 
bisection algorithm [9] is used to rapidly update a partition 
from one grid to the next. The Evaluate (old, new) 
step consists of assigning partitions to processors such that 
the communication cost for data migration is minimized. It 
calculates two numbers; the computational gain comp that 
would be achieved by having a balanced partitioning, and 
the communication cost comm of actually moving all the 
data to correctly map partitions to processors. If comp is 
larger than comm, it is advantageous to use the new parti- 
tioning. In that case, the CFD simulation is interrupted while 
all the necessary data is redistributed based on the processor 
assignments. The CFD calculation is then restarted on the 
new partitions. Otherwise, the new partitioning is discarded 
and JOVE waits for the next adapted mesh. 

3.2. Dual graph representation 

The dual graph representation of the initial mesh is one of 
the key features of this work. CFD flow solvers usually solve 
for the solution variables at the vertices of the computational 
mesh. A parallel implementation requires a partitioning of 
the computational mesh such that each element belongs to 
a unique partition. Communication is required across faces 
that are shared by adjacent tetrahedral elements residing on 
different processors. Hence for the purposes of partitioning, 
we consider the dual of the original CFD mesh (cf. Fig. 1 ). 
The tetrahedral elements of the CFD mesh are the vertices 
of the dual graph. An edge exists between two dual graph 
vertices if the corresponding elements share a face in the 
original mesh. A graph partitioning of the dual graph thus 
yields an assignment of tetrahedra to processors. 

Each dual graph vertex has two parameters associated 
with it. The computational weight, Wcomp. is a measure 
of the workload for the corresponding element of the CFD 
mesh. The communication weight, tCconim. measures the 
cost of moving the element from one processor to another. 
The connectivity pattern and the Wcomp determine how dual 
graph vertices should be grouped to form partitions that 
minimizes the disparity in the partition weights. The 
determine how partitions should be assigned to processors 
such that the cost of data movement is minimized. 

The most significant advantage of using a dual graph is 
that its complexity and connectivity remains unchanged dur- 
ing the course of an adaptive computation. This is because 
the vertices of the dual graph correspond to the elements of 
the initial CFD mesh. The partitioning and load-balancing 
times therefore depend only on the initial problem size. New 
grids obtained by mesh adaption are translated to the two 
weights, w mmp and Wco™, for every element in the initial 
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CFD mesh. The normalized uwp values greater than unity 
are shown for the dual graph vertices in Fig. 1(b). 

3 3. Preliminary evaluation of adapted meshes 

The objective of the Pre.eval (new) step in JOVE is 
to rapidly determine if the dual graph with a new distribution 
of computational weights should be considered for reparti- 
tioning. If projecting the new values of u>com P on the current 
partitions indicates that they are adequately load balanced, 
there is no need to repartition the mesh. In that case, JOVE 
terminates and the CFD application continues uninterrupted 
on the current partitions. 

A proper metric is required to measure the load imbal- 
ance. If W mgx is the sum of the u>comp on the most heavily- 
loaded processor, and W Ayg is the average load across all 
processors, the average idle time for each processor is 
( - Wavg). This is an exact measure of the load im- 
balance. The mesh is repartitioned if the imbalance factor 
Wmax/TVavg is greater than a user-specified threshold. 

3*4. Dynamic inertial spectral mesh partitioning 

If the preliminary evaluation step determines that the 
dual graph with a new weight distribution is unbalanced, 
JOVE invokes the mesh partitioning procedure. Several 
partitioning algorithms are available for unstructured grids; 
however, a new procedure that combines the high quality of 
spectral methods [7] with an efficient update strategy is used. 
This dynamic spectral bisection algorithm [9] is based on the 
center of inertia of the unpartitioned dual graph vertices and 
utilizes information from the initial spectral partitioning. It 
is thus capable of rapidly updating a partition from one grid 
to the next. The following algorithm explains the method: 

for (i— 0; i <\og(npart ); i++) /* npart * #partitions */ 

for 0-0; j < 2‘; j++) { 

Find an inertial vector of the unpartitioned vertices 

Construct an inertial matrix using the inertial vector 

Symmetrize the inertial matrix 

Find the eigenvectors of the inertial matrix 

Project vertex coordinates on eigenvector 0 

Sort projected coordinates 

Divide the unpartitioned vertices into two sets 

} 

3.5. Similarity metric construction for evaluation 

The objective of the evaluation step is to map new par- 
titions to processors such that the communication cost for 
redistributing data is minimized. It begins by computing a 
similarity measure S that indicates how the communication 
weights of the new partitions are distributed over the old 


partitions. It is represented as a matrix where is the sum 
of the communication weights of all the dual graph vertices 
that have moved from old partition i to new partition j. 

Consider, for example, a dual graph that generates the 
measure S in Fig. 3(a) after a repartitioning among eight 
processors. Only the non-zero entries are shown. Note 
that there are only three non-zero entries in the first row. 
This means that the vertices in old partition 0 have been 
distributed over new partitions 0, 1, and 3. Also, it would 
cost 389 to move those vertices in old partition 0 that are 
common to new partition 0, 510 to move those that are 
common to new partition 1 , and 1 20 to move those that are 
common to new partition 3. 

3.6. Processor reassignment 

A new partition j with the largest value of Sij is called 
the dominant partition for old partition i. This is because 
the communication cost for moving data can be minimized 
by mapping the processor assigned to an old partition to 
its corresponding dominant partition. The shaded entries in 
Fig. 3(a) indicate the largest computational weight for each 
of the old partitions. These are called the dominant weights. 
A serious problem is evident by inspecting the dominant 
weights in Fig. 3(a). Even though every old partition has 
a dominant partition, every new partition is not necessarily 
dominant. This affects the new partitions in two ways. First, 
some new partitions are not dominant at all; their processor 
assignment entries are marked with an 4 X\ Second, some 
new partitions are dominant for more than one old partition; 
their processor assignment entries are marked with a ‘?\ 
Our goal is to assign each processor a unique partition. 
Thus, the dominant partitions need to be rearranged so that 
there is exactly one dominant weight in every row and 
column of the similarity matrix 5. Processor assignment 
then simply consists of mapping each dominant partition to 
the processor to which the old partition was originally as- 
signed. However, this rearrangement constitutes a difficult 
optimization problem [10]. Due to runtime constraints, a 
suboptimal solution is obtained in linear time. The follow- 
ing algorithm ensures that each new partition is designated 
as dominant for exactly one old partition 

for (i-O; i < npart ; »++) /* npart: #partitions */ 

for 0—1; J < ndp[i ]; j++) { /* ndp[i ]: #dom wghts */ 

Find min dominant weight Sh from new partition i 
Find max non-dominant weight Sj* from old partition l 
such that ndp[k ] < 1 

Mark Su non-dominant and dominant 
ndp[k]~ 1 

} 

The inner loop is executed only for those partitions that 
have more than one dominant weight. Applying the above 
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Figure 3. The similarity matrix (a) before and 
(b) after processor reassignment. 

algorithm to the similarity matrix in Fig. 3(a) generates the 
new processor assignment shown in Fig. 3(b). In general, 
our method is also applicable if the number of partitions is 
an integer multiple of the number of processors. 

3.7. Computational gain vs. communication cost 

The computational gain of repartitioning is proportional 
to the decrease in the load imbalance achieved by running 
the adapted mesh on the new partitions rather than on the 
old partitions. Recall from Sec. 3.3 that the average load 
imbalance for each processor is given by ( W max - W^g). 
The decrease in the amount of load imbalance due to the new 
partitioning on P processors is therefore P(W° 1 ^ - W,^), 
where W^. and are the sum of the computational 
weights on the most heavily-loaded processor for the old 
and new partitionings, respectively. If it requires T [U:r psecs 
to run one iteration of the CFD flow solver on one element 
of the original mesh, and if it is expected that the next 
mesh adaption will occur after .'Vadapt iterations of the flow 
solver, the total computational gain for the new partitioning 
is PTteN^(W*-W™). 

Calculating the communication cost is more complicated. 
The similarity matrix obtained after processor reassignment 


determines how data is to be redistributed. Models such as 
LogP [3] capture communication behavior with various pa- 
rameters. We, however, use a model based on the similarity 
matrix and two machine-dependent parameters: the remote- 
memory latency time T( at and the message setup time ;tup . 
T Jat is the time required for memory-to-memory copying of 
a word, and applies to every dual grid vertex that is moved. 
Tgetup is the time required to prepare message headers, load 
the message buffer, and so on, and applies to each set of 
vertices that is moved from one processor to another. 

Consider the similarity matrix in Fig. 4. Old partition 0 
is distributed over new partitions 0, 1 , and 3. However, data 
has to be moved only to partitions 0 and 3 because new parti- 
tion 1 is assigned to P0, the same processor that old partition 
0 was assigned to. This means that a total of 509 computa- 
tional elements have to be moved from P0. Moreover, since 
the elements have to be sent to P4.and P7, the setup time 
for moving two sets of data also has to be included in the 
total cost. If the CFD and mesh adaption algorithms require 
M words of storage per computational element, and if C 
and N are the total number of elements and sets of elements 
to be moved, respectively, the total communication cost for 
mapping new partitions to processors is CMT\ &t + 

New Partitions 


0 1 j 2 3 . 4 ; 5 i 6 | 7 


0 509 1 2 

1 Dl DllllS 455 1 2 



0=4200 N=18 

Processors 


Figure 4. Calculating the total communication 
cost from the similarity matrix. 

The new partitioning and mapping are accepted if the 
computational gain is greater than the communication cost. 
The numerical simulation is then interrupted to properly 
redistribute all the data based on the processor reassignment 
obtained from the similarity matrix. This completes the load 
balancing phase for one mesh adaption step. 

4. Results and discussions 

4.1. JOVE implemented on SP2 

The load balancer JOVE, as described in Sec. 3, has been 
implemented on the IBM SP2 distributed-memory multipro- 
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cessor installed at NASA Ames Research Center. The code 
consists of approximately 3000 lines of C t with the parallel 
activities implemented in Message-Passing Interface (MPI) 
for portability. This does not include the 3D_TAG mesh 
adaption procedure which is another 4000 lines of C code. 
A master-worker parallel programming paradigm is used to 
simplify the implementation. 

4.2. Test meshes and adaption simulation 

TWo model unstructured meshes are used in the experi- 
ments reported in this paper. The first mesh, called PARC, is 
two dimensional and has 1240 triangular elements. The sec- 
ond mesh, called BRICK, is three dimensional and has 2500 
tetrahedral elements. Both are very small meshes, suitable 
for investigating fundamental issues in load balancing with 
reasonable execution times. Using realistic CFD meshes 
consisting of about a million elements would unnecessarily 
hinder our investigations as they have long execution times 
even on large-scale machines. Small meshes, on the other 
hand, allow us to look into the behavior of the load balancer 
in a reasonable time frame with a wide range of different 
parameters and settings. 

The actual mesh adaption procedure has been simulated 
in parallel while retaining its typical behavior. I\vo funda- 
mental issues need to be addressed in the simulation of mesh 
adaption: vertex selection and adaption modeling. Vertex 
selection refers to how and when dual graph vertices are 
selected as candidates for adaption. Vertices are randomly 
selected for adaption regardless of its partition number. At 
each iteration, a vertex is adapted if its id modulo a pseudo- 
random number lies within a certain range. Adaption mod- 
eling refers to how much computation each vertex should 
perform. Our adaption simulator is defined as three nested 
loops with the innermost consisting of a floating point divi- 
sion. Each loop has w iterations, where w is the weight of 
the vertex. Therefore, if a vertex with weight w is selected 
for adaption, its weight is set to w 3 and goes through w 3 iter- 
ations of floating point divisions. We have done substantial 
mesh adaption on realistic meshes in the past [5, 11] and 
find that this model is suitable for our experiments. 

43. Anatomy of the execution time 

We discuss how and where the total execution time is 
spent for each mesh adaption step. For typical, unsteady 
CFD calculations, the mesh adaption and load balancing 
phases are invoked several hundred times. However, it suf- 
fices to investigate for some reasonable number of adaptions 
to understand the behavior of the whole system. The execu- 
tion time is measured for various steps and summarized into 
four categories: adaption, partitioning, evaluation/decision, 
and communication. Note that the communication time is 


a combination of several activities that include sending and 
receiving weights, and redistributing dual graph vertices 
among processors. Figure 5 shows the execution time pro- 
file for the first 30 adaptions on the BRICK mesh using 16 
and 64 processors. 



Figure 5. Anatomy of total execution time for 
BRICK mesh. 

We can draw three major conclusions from the plots. 
First, we find that the partitioning and evaluation times are 
small compared to the adaption and communication times. 
It is also noteworthy that the partitioning and evaluation 
times remain constant throughout the simulation. However, 
as expected, the partitioning time increases with the number 
of processors. For example, the partitioning time is about 
0.1 secs for 16 processors, but increases to 0.25 secs for 64 
processors. This is not surprising because the master pro- 
cessor needs more time to partition the grid into 64 subgrids 
than into 16 subgrids. 

Second, the mesh adaption and communication times 
dominate the total execution time. In particular, the adaption 
time is dominant when the number of processors is small, 
as seen in Fig. 5(a). There is an order difference between 
adaption and communication times. However, with 64 pro- 
cessors, the two times are comparable (cf. Fig. 5(b)). This 
trend is expected to continue as the number of processors 
increases; that is, the communication time will dominate 
when more processors are used. However, this is not alarm- 
ing because the adaption time is artificially very small for 
the model problems. The typical execution time for one 
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mesh adaption step for realistic problems is a few hundred 
seconds, not fractions of a second [5]. Thus, in real appli- 
cations, the adaption time will almost always be much more 
than the communication time. The plots in Fig. 5 indicate 
that for most parallel applications, an increase in the num- 
ber of processors will substantially lower the adaption time 
while increasing the communication time. 

Finally, we have analyzed the execution time for only 
30 adaptions. Full-scale, unsteady applications typically 
require several hundred mesh adaption steps. As observed 
from the plots, the execution time relentlessly increases as 
the number of adaptions increases. 

4.4. Effect of mesh adaption on data movement 

Figure 6 shows the percentage of dual graph vertices that 
are moved at runtime after each adaption. We present results 
for both the PARC and the BRICK meshes. 



Figure 6. Percentage of dual graph vertices 
that are moved. 


The plots demonstrate that the PARC mesh incurs a lot 
more relative data movement than the BRICK mesh. This 
is because BRICK has more vertices than PARC. We also 
find that the amount of data movement increases with the 
number of processors for both meshes. This explains the 
increase in the communication time in Fig. 5. 



Figure 7. Comparison of total execution times 
with and without load balancing. 


4.5. Impact of dynamic load balancing 

Two sets of experiments were performed to measure the 
effectiveness of JOVE. These represent the key results of 
this paper. The same vertex selection and adaption modeling 
procedures were used with and without load balancing. Fig- 
ure 7 illustrates the impact of load balancing on the total exe- 
cution time. The plots show that when 8 processors are used 
for the BRICK mesh, the load balancing gives more than a 
threefold improvement over no load balancing. However, 
with 64 processors, the improvement is almost sixfold. In 
general, the results demonstrate that load balancing is highly 
nondeterministic but shows some gain for BRICK when the 
number of processors increases. This improvement is not 
observed for PARC primarily because it is a very small 
problem. We expect larger improvement for bigger meshes 
because of increased computation-to-communi cation ratio. 

Figure 8 demonstrates the implication of this perfor- 
mance improvement when the load balancer JOVE is used 
with mesh adaption. When compared with the sequential 
version, JOVE demonstrates a 24-fold speedup for 40 adap- 
tion steps. For 10 adaptions, which is quite unrealistic, the 
speedup is only about 10. We also see a typical phenomenon 
of early saturation. However, the speedup consistently in- 
creases with the number of adaptions. For real problems 
with several hundred adaption steps, the speedup will in- 
crease further as the curves in Fig. 8 suggest. 
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Number of processors 

Figure 8. Parallel speedup of JOVE. 

5. Conclusions 

Dynamic load balancing for unstructured adaptive mesh 
computations is a complex task, involving many procedures 
and parameters. While typical load balancing schemes lo- 
cally exchange information between neighboring proces- 
sors, we have presented a new method called JOVE that 
dynamically balances loads across processors with a global 
view. JOVE has been implemented on an SP2 distributed- 
memory multiprocessor with approximately 3000 lines of 
C code. Parallel activities have been implemented in MPI 
for portability. We have used two model meshes for exper- 
iments: one with 1240 elements, and the other with 2500 
elements. While these meshes are small, they are suitable 
for our investigations as the execution times are reasonable. 

Two key observations can be made from the experiments 
reported in this paper. First, the JOVE load balancing mod- 
ule has given a sixfold improvement for mesh adaption, 
when compared with no balancing regardless of the number 
of processors. Second, JOVE has given a 24-fold speedup 
on 64 processors, when compared with a sequential single- 
processor version that has no parallel constructs. These 
observations are based on the measurement of only 30 mesh 
adaption steps. Results have indicated that performance will 
improve with more adaptions and laiger meshes. 

We have also drawn some other conclusions that clarify 
the behavior of load balancing for mesh adaption. First, 
the partitioning and evaluation times are negligible com- 


pared to the adaption and communication times, regardless 
of the number of processors. This has indicated that even 
the sequential version of the new inertial spectral partitioner 
is indeed quite fast. Second, the adaption time decreases 
while the communication time increases as the number of 
processors is increased. This is somewhat expected, and our 
future efforts will be focused on reducing the communica- 
tion time. Finally, the number of vertices that are moved 
due to repartitioning does not appear to be a key factor that 
affects the effectiveness of load balancing. We have found 
that with 64 processors, performance still sustained a sixfold 
improvement even when 25% of all vertices were moved. 

These experimental results have consistently demon- 
strated that the JOVE load balancer is effective for unstruc- 
tured adaptive mesh computations. Our immediate goal is 
to run JOVE on large meshes with several hundred adap- 
tion steps that closely model full-scale experiments. We are 
planning on applying this method to various realistic appli- 
cations including helicopter aerodynamics, semiconductor 
device modeling, and computational nanotechnology. 
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